Feat/external eval server #23

olesho · 2025-07-13T19:22:19Z

No description provided.

eval-server/logs/combined.log

eval-server/logs/error.log

eval-server/logs/evaluations.jsonl

Copilot

Pull Request Overview

This PR introduces a comprehensive external evaluation server system for testing AI Chat agent tools in DevTools. The feature enables external evaluation of AI Chat tools via WebSocket RPC communication between DevTools and a standalone evaluation server.

External Evaluation Infrastructure: Complete WebSocket-based evaluation server with client management, authentication, and tool execution
DevTools Integration: New evaluation configuration UI in Settings Dialog with connection management and testing capabilities
Comprehensive Test Suite: 25+ evaluation test cases covering web task agents, schema extractors, and various real-world scenarios

Reviewed Changes

Copilot reviewed 114 out of 118 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
front_end/panels/ai_chat/ui/SettingsDialog.ts	Adds evaluation configuration UI with connection status, endpoint settings, and test functionality
front_end/panels/ai_chat/evaluation/EvaluationAgent.ts	Implements WebSocket client for receiving and executing evaluation requests from server
front_end/panels/ai_chat/evaluation/EvaluationProtocol.ts	Defines message protocol for client-server evaluation communication
front_end/panels/ai_chat/common/EvaluationConfig.ts	Manages evaluation configuration, connection state, and localStorage persistence
front_end/panels/ai_chat/common/WebSocketRPCClient.ts	Generic WebSocket RPC client with reconnection and error handling
eval-server/src/server.js	Main evaluation server with client management, authentication, and evaluation orchestration
eval-server/evals/web-task-agent/*.yaml	Comprehensive test cases for web task agent covering social media, search, booking, finance, etc.
front_end/panels/ai_chat/**/ErrorHandlingUtils.ts	Improves error logging consistency across multiple files

Files not reviewed (1)

eval-server/package-lock.json: Language not supported

Copilot · 2025-07-17T23:22:03Z

front_end/panels/ai_chat/ui/SettingsDialog.ts

+
+    // Set up periodic status updates every 2 seconds
+    const statusUpdateInterval = setInterval(updateConnectionStatus, 2000);
+


The status update interval is created but never cleaned up when the dialog is closed or destroyed. This could lead to memory leaks and unnecessary function calls. Consider storing the interval ID and clearing it in a cleanup function.

Suggested change

// Ensure the interval is cleared when the dialog is closed

this.addEventListener('close', () => {

clearInterval(statusUpdateInterval);

});

Copilot · 2025-07-17T23:22:03Z

front_end/panels/ai_chat/ui/SettingsDialog.ts

+    const clientIdInput = document.createElement('input');
+    clientIdInput.type = 'text';
+    clientIdInput.className = 'settings-input';
+    clientIdInput.value = currentEvaluationConfig.clientId || 'Auto-generated on first connection';


[nitpick] The fallback text 'Auto-generated on first connection' should be extracted to a constant or the UIStrings object for better maintainability and potential internationalization.

Suggested change

clientIdInput.value = currentEvaluationConfig.clientId || 'Auto-generated on first connection';

clientIdInput.value = currentEvaluationConfig.clientId || i18nString(UIStrings.autoGeneratedOnFirstConnection);

Copilot · 2025-07-17T23:22:03Z

front_end/panels/ai_chat/evaluation/EvaluationAgent.ts

+        });
+        // For new clients, the server created the config and asks to reconnect
+        // We can attempt to reconnect after a short delay
+        setTimeout(() => {


[nitpick] The hardcoded 1000ms delay for reconnection retry should be extracted to a constant or configuration parameter for better maintainability and testability.

Copilot · 2025-07-17T23:22:04Z

front_end/panels/ai_chat/evaluation/EvaluationAgent.ts

+          timestamp: new Date().toISOString()
+        });
+      }
+    }, 30000); // Send ping every 30 seconds


[nitpick] The hardcoded 30000ms (30 seconds) heartbeat interval should be extracted to a constant or configuration parameter for better maintainability.

Suggested change

}, 30000); // Send ping every 30 seconds

}, HEARTBEAT_INTERVAL_MS); // Send ping every 30 seconds

Copilot · 2025-07-17T23:22:04Z

front_end/panels/ai_chat/common/WebSocketRPCClient.ts

+    }
+
+    this.currentReconnectAttempt++;
+    const delay = this.reconnectDelay * Math.pow(2, this.currentReconnectAttempt - 1);


The exponential backoff calculation could result in very large delays for higher retry attempts. Consider adding a maximum delay cap to prevent excessively long wait times.

Suggested change

const delay = this.reconnectDelay * Math.pow(2, this.currentReconnectAttempt - 1);

const MAX_RECONNECT_DELAY = 30000; // Maximum delay cap in milliseconds (30 seconds)

const delay = Math.min(this.reconnectDelay * Math.pow(2, this.currentReconnectAttempt - 1), MAX_RECONNECT_DELAY);

Copilot · 2025-07-17T23:22:04Z

eval-server/src/server.js

+          // Auto-create new client configuration
+          try {
+            logger.info('Auto-creating new client configuration', { clientId });
+            await this.clientManager.createClientWithId(clientId, `DevTools Client ${clientId.substring(0, 8)}`, 'hello');


The hardcoded default secret key 'hello' poses a security risk. Consider generating a strong random secret key or requiring explicit secret key configuration for new clients.

Suggested change

await this.clientManager.createClientWithId(clientId, `DevTools Client ${clientId.substring(0, 8)}`, 'hello');

const randomSecretKey = crypto.randomBytes(32).toString('hex');

await this.clientManager.createClientWithId(clientId, `DevTools Client ${clientId.substring(0, 8)}`, randomSecretKey);

Copilot · 2025-07-17T23:22:04Z

eval-server/src/server.js

+          // Don't automatically start evaluations - wait for manual trigger
+          // this.processClientEvaluations(connection.clientId);


[nitpick] This commented-out line and its explanation suggest incomplete functionality. Either implement the automatic evaluation feature or remove the commented code and explanation to avoid confusion.

Suggested change

// Don't automatically start evaluations - wait for manual trigger

// this.processClientEvaluations(connection.clientId);

// Evaluations must be manually triggered.

olesho added 2 commits July 13, 2025 08:38

Initial setup of external evaluation server

47e4617

Evaluations server

66f366e

tysonthomas9 requested a review from Copilot July 14, 2025 15:52

This comment was marked as outdated.

Sign in to view

olesho added 3 commits July 14, 2025 18:46

Refactoring and cleanup

2fd8b71

Updated evals server and client

a82b577

Tracing added to all evaluations

0a18ecd

tysonthomas9 reviewed Jul 17, 2025

View reviewed changes

eval-server/logs/combined.log Outdated Show resolved Hide resolved

tysonthomas9 reviewed Jul 17, 2025

View reviewed changes

eval-server/logs/error.log Outdated Show resolved Hide resolved

tysonthomas9 reviewed Jul 17, 2025

View reviewed changes

eval-server/logs/evaluations.jsonl Outdated Show resolved Hide resolved

tysonthomas9 requested a review from Copilot July 17, 2025 23:20

Copilot AI reviewed Jul 17, 2025

View reviewed changes

Clean up logs and revert the error retry logic

3ae8b2f

tysonthomas9 merged commit 543b097 into tysonthomas9:main Jul 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat/external eval server #23

Feat/external eval server #23

olesho commented Jul 13, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jul 17, 2025

Uh oh!

Copilot AI Jul 17, 2025

Uh oh!

Copilot AI Jul 17, 2025

Uh oh!

Copilot AI Jul 17, 2025

Uh oh!

Copilot AI Jul 17, 2025

Uh oh!

Copilot AI Jul 17, 2025

Uh oh!

Copilot AI Jul 17, 2025

Uh oh!

Uh oh!


		// Set up periodic status updates every 2 seconds
		const statusUpdateInterval = setInterval(updateConnectionStatus, 2000);

+    // Ensure the interval is cleared when the dialog is closed
+    this.addEventListener('close', () => {
+      clearInterval(statusUpdateInterval);
+    });

	clientIdInput.value = currentEvaluationConfig.clientId \|\| 'Auto-generated on first connection';
	clientIdInput.value = currentEvaluationConfig.clientId \|\| i18nString(UIStrings.autoGeneratedOnFirstConnection);

	}, 30000); // Send ping every 30 seconds
	}, HEARTBEAT_INTERVAL_MS); // Send ping every 30 seconds

	const delay = this.reconnectDelay * Math.pow(2, this.currentReconnectAttempt - 1);
	const MAX_RECONNECT_DELAY = 30000; // Maximum delay cap in milliseconds (30 seconds)
	const delay = Math.min(this.reconnectDelay * Math.pow(2, this.currentReconnectAttempt - 1), MAX_RECONNECT_DELAY);

	await this.clientManager.createClientWithId(clientId, `DevTools Client ${clientId.substring(0, 8)}`, 'hello');
	const randomSecretKey = crypto.randomBytes(32).toString('hex');
	await this.clientManager.createClientWithId(clientId, `DevTools Client ${clientId.substring(0, 8)}`, randomSecretKey);

		// Don't automatically start evaluations - wait for manual trigger
		// this.processClientEvaluations(connection.clientId);

	// Don't automatically start evaluations - wait for manual trigger
	// this.processClientEvaluations(connection.clientId);
	// Evaluations must be manually triggered.

Feat/external eval server #23

Feat/external eval server #23

Conversation

olesho commented Jul 13, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!